Clustering of Infrastructure news articles

Introduction

The South African Institute of Civil Engineering (SAICE) published an Infrastructure Report Card (IRC) for South Africa in which the state of South Africa’s infrastructure is discussed. The report covers the following infrastructure: water, sanitation, solid waste management, roads, airports, airports, ports, oil and gas pipelines, rail, electricity, healthcare, fire, education and information and communication technology.

To accurately access South Africa’s infrastructure, SAICE requires data. However, data for each infrastructure category is not always available or incomplete. To improve the infrastructure evaluation accuracy, SAICE is interested in using data from online news articles. Towards this extent, SAICE has collected 9000 online articles in an Excel file.

Each row of the Excel file represents one news article and contains:

the article id,
the article title,
the article subtitle and
the article text.

Libraries and data loading

library(readxl)
library(reticulate)
library(tidyverse)
library(tidymodels)
library(tidytext)
library(text2vec)
library(embed)
library(umap)
library(uwot)
library(plotly)
library(GGally)
library(textstem)
library(SnowballC)
library(forcats)
library(tm)
library(dbscan)       
library(factoextra)
library(cluster)
library(e1071)
library(quanteda)
library(here)

news_paper_data <- read_excel("C:/Users/LATITUDE 5520/Documents/Portfolio/Clustering_of_News_Articles/data/newspaper_data_raw.xlsx")


news_paper_data %>% 
  head(n=2) %>% 
  DT::datatable(filter = "top")

Show entries

Search:

	id	...2	title	article
	3637 4743	1 2
1	3637	1	Walter Sisulu University Fuels Data Infrastructure Innovation	Aiming to be one of the most innovative educational institutions in the country, this university knew that its data infrastructure needed a comprehensive upgrade. To thrive in the digital age, any organization needs a workforce with strong digital capabilities and universities are often key to developing this talent pool. As digitalization sweeps across all industries, it’s no wonder that educational institutions around the world are beginning to prioritize digital development. Walter Sisulu University (WSU) in South Africa is a good case in point. Located in the Eastern Cape of South Africa, WSU has over 30,000 staff and students spread out across its four major campus sites. With a clear aim to become one of the most innovative universities in the country, one that’s always relevant in the rapidly expanding and evolving digital world, WSU knew that its infrastructure needed a comprehensive upgrade. Specifically, the university required a highly performant and reliable data infrastructure platform built on innovative technologies, to bring about a digital teaching environment and improve the overall quality of teaching. However, achieving such an upgrade appeared to be a huge challenge. WSU’s legacy infrastructure suffered from a wide range of Information Technology (IT) system issues, such as scattered information resources, and each would need to be addressed to ensure success. Roadblocks to University Innovation “Our live network was reaching its performance and support limits, with certain legacy storage devices unable to support our teaching and research workloads,” Brendan Langley, Deputy Director of the WSU Infrastructure Department, said. “so we needed to refresh across the board, covering performance, reliability, and security.” The first and most pressing concern facing the university involved outdated IT systems. Many storage devices on the former live network were approaching End Of Service (EOS), resulting in poor performance, especially in database applications. During the university’s term time, when performance needed to be at its peak, bottlenecks were commonplace on both the administration and registration systems, with the IT team struggling to rapidly respond. Another major sticking point was the lack of Disaster Recovery (DR) capabilities. Any system failure would likely result in service interruptions and, in turn, the potential loss of digital assets. In terms of data security, systems also faced frequent attacks, stemming from phishing software and hackers. And because the legacy data center’s system had no security foundation, a new high security data model would have to be built from scratch. Fully aware of the issues it faced, WSU was clear on what it needed: an innovative, high-quality storage solution that was fast, reliable, and easy to manage. The Key to Getting Unstuck Cooperating with Huawei since 2010, WSU was already familiar with Huawei’s achievements and standing in the Information and Communications Technology (ICT) field. And because a huge range of equipment would need to be replaced on the live network, WSU’s familiarity with, and trust in Huawei’s product portfolios made the decision to work with Huawei once again an easy one. “We know that Huawei is a brand that we can rely on, with clear advantages in End to End (E2E) integrated solutions and services.” Langley explained. WSU deployed Huawei’s OceanStor Dorado all-flash storage to resolve performance issues of critical workloads. Its innovative hardware, algorithms, and architecture provide extremely high performance and millisecond-level latency, eliminating performance bottlenecks and meeting the peak needs of high-performance workloads. OceanStor Dorado also converges Storage Area Network (SAN) and Network-Attached Storage (NAS) in a single storage system, supporting diverse workloads. Now, the university’s storage system latency has been slashed to just 0.34 ms, while performance has increased by 60%. Previously, even a minor system breakdown could cause major service headaches for WSU. But with OceanStor Dorado’s gateway-free active-active DR solution in place, data integrity and service continuity are ensured. The system also includes SAN-based ransomware prevention features — such as secure snapshots — to safeguard data assets. Different generations of OceanStor can also be converged, so WSU adopted the entry-level OceanStor 2600 storage system on the DR side, forming a High Availability (HA) environment. This has given WSU the flexibility to purchase and deploy solutions according to its specific needs, greatly reducing Total Cost of Ownership (TCO). Speed and Scale — in a Flash With the new and improved network now up and running, teaching management and Enterprise Resource Planning (ERP) systems enjoy stable connectivity, while the administration system’s data processing speeds have improved significantly. And thanks to the new storage systems, WSU’s business system enjoys excellent performance that meets the latency of database applications, in turn providing a better overall experience for staff and students alike. Huawei, of course, goes beyond merely providing solutions. If any issues arise in the future, WSU knows that they will be swiftly addressed. “That was one very noticeable benefit with Huawei. The solutions are reliable but, if needed, support is always on hand. We can now quickly troubleshoot any problem and we know that Huawei’s IT support team are right behind us,” Langley said. These upgrades have provided WSU the freedom to focus its energy where it’s needed most: fostering the talents of the students in its care. They have also put the university in a strong position to consider future opportunities as it works to develop a comprehensive smart campus. WSU can now expand its teaching and research programs without compromising data security. And with the project serving as an example for other universities across South Africa looking toward data infrastructure upgrades, it has also put WSU firmly on the path to becoming one of the most innovative, forward-thinking universities, nationwide.
2	4743	2	#Runified – Nelson Mandela Bay continues to be a sporting destination of choice	Nelson Mandela Bay is proud to host yet another elite marathon race on Sunday, 23 May 2021, where athletes will take part in the Nedbank #Runified 50km world record attempt in Summerstrand. This comes after the Eastern Province Athletics (EPA) in partnership with the City hosted a successful 21.1km race earlier this month, where scores of athletes took part in the exciting race after more than one year of interrupted action, due to the COVID-19 pandemic. NMBM Sport, Recreation, Arts and Culture Portfolio Head, Cllr Helga van Staaden said the City was honoured to host some of the best athletes in the world including, Ketema Negasa from Ethiopia, Shadrack Kiptoo Kimaiyo of Kenya, South Africa’s Phillimon Mathiba, leading women athletes including the South African duo of Gerda Steyn and Irvette van Zyl and Ethiopia’s Bashanke Bilo. “We are ready to host these athletes in our Friendly City. Working closely with the EPA we have prepared everything to run smoothly and are quite excited about the exposure we will be getting around the world from this race. We do want to remind our residents that they can catch all the action on television, as spectators are not allowed in line with the COVID-19 regulations,” said Cllr van Staaden. Due to the race, NMBM residents are advised of road closures that will be implemented as follows: The short section between 8th Avenue Summerstrand and the exit from the Summerstrand Life Saving Club has been closed already and will remain closed until 12pm on Sunday, 23 May 2021, while other routes be closed from 2am to 11:30am on Sunday. There will be officials directing traffic during the race. The City would like to apologise for the inconvenience. The race will feature live TV coverage via SABC2 and also be livestreamed on social media platforms including Nedbank Running club Facebook page, EP Athletics Facebook page, Nedbank YouTube.

Showing 1 to 2 of 2 entries

Previous1Next

Data summary
Name	news_paper_data
Number of rows	9000
Number of columns	5
_______________________
Column type frequency:
character	2
logical	1
numeric	2
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
title	2	1.00	3	143	0	8943	0
article	184	0.98	11	32971	0	8796	0

Variable type: logical

skim_variable	n_missing	complete_rate	mean	count
subtitle	9000	0	NaN	:

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
id	0	1	13205.26	7650.51	2	5991.50	13408.5	20445.75	26134	▇▆▆▆▇
…2	0	1	4500.50	2598.22	1	2250.75	4500.5	6750.25	9000	▇▇▇▇▇

We seem to have 5 variables instead of the 4 intended, we will remove the additional one.

We also see that the article variable contains 184 missing values, the title variable has 2 missing values while the subtitle section is empty everywhere.

We basically only have 2 variables of importance: article and title as the id is not informative.

Data Cleaning

The number of article with missing values is: 184

Show entries

Search:

	id	title	article
	3637 4743
1	3637	Walter Sisulu University Fuels Data Infrastructure Innovation	Aiming to be one of the most innovative educational institutions in the country, this university knew that its data infrastructure needed a comprehensive upgrade. To thrive in the digital age, any organization needs a workforce with strong digital capabilities and universities are often key to developing this talent pool. As digitalization sweeps across all industries, it’s no wonder that educational institutions around the world are beginning to prioritize digital development. Walter Sisulu University (WSU) in South Africa is a good case in point. Located in the Eastern Cape of South Africa, WSU has over 30,000 staff and students spread out across its four major campus sites. With a clear aim to become one of the most innovative universities in the country, one that’s always relevant in the rapidly expanding and evolving digital world, WSU knew that its infrastructure needed a comprehensive upgrade. Specifically, the university required a highly performant and reliable data infrastructure platform built on innovative technologies, to bring about a digital teaching environment and improve the overall quality of teaching. However, achieving such an upgrade appeared to be a huge challenge. WSU’s legacy infrastructure suffered from a wide range of Information Technology (IT) system issues, such as scattered information resources, and each would need to be addressed to ensure success. Roadblocks to University Innovation “Our live network was reaching its performance and support limits, with certain legacy storage devices unable to support our teaching and research workloads,” Brendan Langley, Deputy Director of the WSU Infrastructure Department, said. “so we needed to refresh across the board, covering performance, reliability, and security.” The first and most pressing concern facing the university involved outdated IT systems. Many storage devices on the former live network were approaching End Of Service (EOS), resulting in poor performance, especially in database applications. During the university’s term time, when performance needed to be at its peak, bottlenecks were commonplace on both the administration and registration systems, with the IT team struggling to rapidly respond. Another major sticking point was the lack of Disaster Recovery (DR) capabilities. Any system failure would likely result in service interruptions and, in turn, the potential loss of digital assets. In terms of data security, systems also faced frequent attacks, stemming from phishing software and hackers. And because the legacy data center’s system had no security foundation, a new high security data model would have to be built from scratch. Fully aware of the issues it faced, WSU was clear on what it needed: an innovative, high-quality storage solution that was fast, reliable, and easy to manage. The Key to Getting Unstuck Cooperating with Huawei since 2010, WSU was already familiar with Huawei’s achievements and standing in the Information and Communications Technology (ICT) field. And because a huge range of equipment would need to be replaced on the live network, WSU’s familiarity with, and trust in Huawei’s product portfolios made the decision to work with Huawei once again an easy one. “We know that Huawei is a brand that we can rely on, with clear advantages in End to End (E2E) integrated solutions and services.” Langley explained. WSU deployed Huawei’s OceanStor Dorado all-flash storage to resolve performance issues of critical workloads. Its innovative hardware, algorithms, and architecture provide extremely high performance and millisecond-level latency, eliminating performance bottlenecks and meeting the peak needs of high-performance workloads. OceanStor Dorado also converges Storage Area Network (SAN) and Network-Attached Storage (NAS) in a single storage system, supporting diverse workloads. Now, the university’s storage system latency has been slashed to just 0.34 ms, while performance has increased by 60%. Previously, even a minor system breakdown could cause major service headaches for WSU. But with OceanStor Dorado’s gateway-free active-active DR solution in place, data integrity and service continuity are ensured. The system also includes SAN-based ransomware prevention features — such as secure snapshots — to safeguard data assets. Different generations of OceanStor can also be converged, so WSU adopted the entry-level OceanStor 2600 storage system on the DR side, forming a High Availability (HA) environment. This has given WSU the flexibility to purchase and deploy solutions according to its specific needs, greatly reducing Total Cost of Ownership (TCO). Speed and Scale — in a Flash With the new and improved network now up and running, teaching management and Enterprise Resource Planning (ERP) systems enjoy stable connectivity, while the administration system’s data processing speeds have improved significantly. And thanks to the new storage systems, WSU’s business system enjoys excellent performance that meets the latency of database applications, in turn providing a better overall experience for staff and students alike. Huawei, of course, goes beyond merely providing solutions. If any issues arise in the future, WSU knows that they will be swiftly addressed. “That was one very noticeable benefit with Huawei. The solutions are reliable but, if needed, support is always on hand. We can now quickly troubleshoot any problem and we know that Huawei’s IT support team are right behind us,” Langley said. These upgrades have provided WSU the freedom to focus its energy where it’s needed most: fostering the talents of the students in its care. They have also put the university in a strong position to consider future opportunities as it works to develop a comprehensive smart campus. WSU can now expand its teaching and research programs without compromising data security. And with the project serving as an example for other universities across South Africa looking toward data infrastructure upgrades, it has also put WSU firmly on the path to becoming one of the most innovative, forward-thinking universities, nationwide.
2	4743	#Runified – Nelson Mandela Bay continues to be a sporting destination of choice	Nelson Mandela Bay is proud to host yet another elite marathon race on Sunday, 23 May 2021, where athletes will take part in the Nedbank #Runified 50km world record attempt in Summerstrand. This comes after the Eastern Province Athletics (EPA) in partnership with the City hosted a successful 21.1km race earlier this month, where scores of athletes took part in the exciting race after more than one year of interrupted action, due to the COVID-19 pandemic. NMBM Sport, Recreation, Arts and Culture Portfolio Head, Cllr Helga van Staaden said the City was honoured to host some of the best athletes in the world including, Ketema Negasa from Ethiopia, Shadrack Kiptoo Kimaiyo of Kenya, South Africa’s Phillimon Mathiba, leading women athletes including the South African duo of Gerda Steyn and Irvette van Zyl and Ethiopia’s Bashanke Bilo. “We are ready to host these athletes in our Friendly City. Working closely with the EPA we have prepared everything to run smoothly and are quite excited about the exposure we will be getting around the world from this race. We do want to remind our residents that they can catch all the action on television, as spectators are not allowed in line with the COVID-19 regulations,” said Cllr van Staaden. Due to the race, NMBM residents are advised of road closures that will be implemented as follows: The short section between 8th Avenue Summerstrand and the exit from the Summerstrand Life Saving Club has been closed already and will remain closed until 12pm on Sunday, 23 May 2021, while other routes be closed from 2am to 11:30am on Sunday. There will be officials directing traffic during the race. The City would like to apologise for the inconvenience. The race will feature live TV coverage via SABC2 and also be livestreamed on social media platforms including Nedbank Running club Facebook page, EP Athletics Facebook page, Nedbank YouTube.

Showing 1 to 2 of 2 entries

Previous1Next

Text transformation

As we are dealing with text, we need to transform our data into a usable format by concatenating the rows of interests, removing any unwanted signs or numbers, extract tokens, and much more depending on the need.

corpus_df <- news_paper_df %>%
  unnest_tokens(word, article, 
                token = "regex", 
                pattern = "[^A-Za-z]+",
                to_lower = FALSE)

#Create the vocabulary of the articles
vocabulary <- corpus_df %>% 
  select(word) %>% 
  unique()

#Print the respective values
cat("The corpus contains", length(corpus_df$word),"tokens\n",
    "While the vocabulary has", length(vocabulary$word),"unique tokens")

The corpus contains 3340566 tokens
 While the vocabulary has 76613 unique tokens

Let us see the distribution of the most occurring words below:

top_20_tokens <- corpus_df %>% 
  select(word) %>% 
  count(word, name = "token_count") %>% 
  arrange(desc(token_count)) %>% 
  slice(1:20)

top_20_tokens %>%
  mutate(word = fct_reorder(word, -token_count)) %>% 
  ggplot(aes(x = word, y = token_count)) +
  geom_bar(stat = "identity") +
  labs(
    title = "Histogram of most frequent tokens (words)",
    x = "word",
    y = "count"
  ) +
  theme_minimal()

Most of these are stop words (to, a, on, …) and also words are case-sensitive.

Let us compare now the most common words after removing the stop words:

Text normalisation

Let us perform some text normalisation by applying techniques such as case folding and stemming and see how that influences our vocabulary size:

The size of the vocabulary after case folding: 63770

The size of the vocabulary after stemming: 57246

The size of the vocabulary after case folding and stemming: 46696

Data quality transformation

Could we have some articles that are way shorter than others?

The above plot showcases a strong distribution. For this reason, using a count vectorizer in modeling might not be the best option.

Instead, we should use a term frequency - inverse document frequency(tf-idf) vectoriser.

Let us remove articles that have less than 100 words:

news_paper_df %>% 
  filter(article_length < 100) %>% 
  count(name = "less than 100 words") %>% 
  DT::datatable()

	less than 100 words
1	23

news_paper_df <- news_paper_df %>% 
  filter(article_length >= 100)

Text embedding

As most algorithms cannot work with text directly, we will be transforming them using the following:

lemmatization;
stop words removal;
removing word having less than 2 letters;
number removal;
tf_idf;
removal of words appearing in less than 5 articles, which is different than words appearing only 5 times.
removing words that appear less than 5 times in the corpus

embedding <- 
  news_paper_df %>%
  unnest_tokens(word, article, 
                token = "regex", 
                pattern = "[^A-Za-z]+") %>%
  filter(nchar(word) >= 2) %>% 
  anti_join(get_stopwords()) %>% 
  add_count(word, sort = TRUE) %>% 
  group_by(word) %>% 
  filter(sum(n) >= 5) %>% 
  ungroup() %>% 
  filter(n >= 5) %>% 
  select(-n) %>% 
  mutate(word = lemmatize_words(word)) %>% 
  count(id, word, sort = TRUE) %>% 
  bind_tf_idf(word, id, n) 


## Let us extract the column names:
vocab <- 
  embedding %>%
  select(word) %>%
  distinct() %>%
  pull(word)


## Embedding tibble
embedding_tbl <- embedding %>% 
  select(-c(tf, idf, n)) %>% 
  pivot_wider(names_from = word, values_from = tf_idf, 
              values_fill = 0, names_repair = "unique") %>% 
  rename(id = id...1)


embedding_tbl

Each row here represents a single article and each column a unique token (word).

Let us analyse the tf-idf of the 873th document for example and see what it is about:

First we look at the article’s content below:

[1] "The NDK0 11kV Oil Circuit Breaker at the Nivensdrift substation tripped, affecting Nivensdrift, Kruisrivier, and the surrounding areas. Staff attending to restoration of supply, no timeframe available. We apologise for the inconvenience caused."

The article appears to be electricity.

Below is the repartition of the top 10 word with the highest tf-idf:

# A tibble: 10 × 4
      id word              n tf_idf
   <dbl> <chr>         <int>  <dbl>
 1  4038 breaker           1  0.321
 2  4038 timeframe         1  0.312
 3  4038 kv                1  0.307
 4  4038 restoration       1  0.264
 5  4038 circuit           1  0.250
 6  4038 oil               1  0.221
 7  4038 apologise         1  0.219
 8  4038 substation        1  0.215
 9  4038 inconvenience     1  0.210
10  4038 trip              1  0.187

Which document is most similar to the 873th document according to the tf-idf embeddings?

# A tibble: 5 × 2
     id similarity[,1]
  <dbl>          <dbl>
1  4038          1    
2  3992          0.645
3  4047          0.574
4  1270          0.335
5 18491          0.239

Below are the top 10 words of the article (id=3992) which is the most similar to the 873th article based on cosine similarity:

# A tibble: 10 × 4
      id word              n tf_idf
   <dbl> <chr>         <int>  <dbl>
 1  3992 timeframe         1  0.330
 2  3992 restoration       1  0.279
 3  3992 apologise         1  0.231
 4  3992 inconvenience     1  0.223
 5  3992 outage            1  0.196
 6  3992 surround          1  0.190
 7  3992 section           1  0.175
 8  3992 party             1  0.161
 9  3992 deal              1  0.149
10  3992 attend            1  0.148

Dimensionality reduction

Most clustering algorithms become slow in the presence of high dimensional data. To alleviate this, we perform dimensionality reduction using the UMAP algorithm and we reduce it to only 4 components:

umap <- uwot::umap(embedding_tbl[,-1], n_components = 4, seed = 2024)

# A tibble: 8,793 × 4
        V1    V2      V3      V4
     <dbl> <dbl>   <dbl>   <dbl>
 1 -0.0964 1.09  -1.36   -1.33  
 2 -0.562  0.968 -1.38   -0.165 
 3 -0.145  0.705 -0.632  -0.190 
 4 -0.171  1.10  -0.816   0.0783
 5 -0.0239 0.178  0.0301 -0.258 
 6  0.0837 1.11  -0.912  -0.285 
 7  0.379  1.47  -1.55   -0.841 
 8 -0.271  1.33  -0.483   0.0136
 9 -0.157  1.28  -1.17   -0.504 
10 -0.432  1.21  -1.29   -0.240 
# ℹ 8,783 more rows

Voilà!

We notice that we have a big cluster and some little cluster(s) as well which could indicate a group of articles that are similar to one another and dissimilar to most of the articles. These small clusters could also be outliers.

Also besides umap, we could try other clustering algorithms such as SOM or even DBSCAN and see whether clusters are more visible than this.

As for now, thank you for your attention and see you soon. 🖖